## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 1.0.2
## ✔ tibble 3.2.1 ✔ dplyr 1.1.4
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
This study explores the application of complex algorithms to predict experimental success using neural activity data. I employed XGBoost, Logistic Regression and Random Forest algorithms to model complex relationships within the data and predict experimental outcomes. The analysis involves evaluation of model performance using cross-validation and confusion matrix analyses. Results highlight the effectiveness of all three algorithms in predicting experimental success and offer insights for future understanding of such neurological data.
In this study, I embarked on an journey of data analysis to gain potential insights into the neural activity data collected from experimental sessions based on the study conducted previously by Nicholas Steinmetz. The plethora of features in the “sessions” data consisted of those such as feedback type, left and right contrasts, brain region focused on for each trial, the spike rates and the names of the mice on whose brains the study was conducted.
My primary objective was to analyze and interpret the complex relationships within the data, especially with a focus on success rates based on the average spike rates across 40 of the given time bins.
I began through pre-processing the data, then performed Exploratory Data Analysis to ensure robust evaluation of the models. Subsequently, we explored two model algorithms, namely XGBoost and Random Forest, to model the relationship between many input features including feedback type, left and right contrasts and average spikes across time bins, and experimental success.
Using XGBoost, I trained models on the integrated datasets created through my findings from the data analysis. Through performance-assessing techniques such as cross-validation, confusion matrix analysis, and accuracy metrics, I assessed the performance of our XGBoost models and provided details into their predictive capabilities.
Secondly, I incorporated Logistic Regression in order to compare the performance to that of XGBoost. Similarly, with the help of accuracy metrics, I assessed the performance on two datasets that individually emphasized on the first 20 and last 20 time bins respectively.
Moreover, I furthered my analysis by incorporating Random Forest by leveraging features such as feedback types to be used as splits in the decision trees and I explored the potential of Random Forest in capturing patterns within the data and making accurate predictions to facilitate the understanding of the data.
After loading the sessions data, I performed a few preliminary steps to ensure my data was organized into tibbles that would make the exploratory data analysis process more seamless. First, I defined two functions that would help me retrieve information about specific trials,
## # A tibble: 6 × 8
## brain_area region_sum_spike region_count region_mean_spike trial_id
## <chr> <dbl> <int> <dbl> <dbl>
## 1 ACA 64 109 0.587 2
## 2 CA3 116 68 1.71 2
## 3 DG 53 34 1.56 2
## 4 LS 190 139 1.37 2
## 5 MOs 88 113 0.779 2
## 6 SUB 146 75 1.95 2
## # ℹ 3 more variables: contrast_left <dbl>, contrast_right <dbl>,
## # feedback_type <dbl>
as well as specific trials within a needed session:
## # A tibble: 6 × 11
## brain_area region_sum_spike region_count region_mean_spike trial_id
## <chr> <dbl> <int> <dbl> <int>
## 1 ACA 138 109 1.27 1
## 2 CA3 104 68 1.53 1
## 3 DG 79 34 2.32 1
## 4 LS 269 139 1.94 1
## 5 MOs 106 113 0.938 1
## 6 SUB 156 75 2.08 1
## # ℹ 6 more variables: contrast_left <dbl>, contrast_right <dbl>,
## # feedback_type <dbl>, mouse_name <chr>, date_exp <chr>, session_id <dbl>
I then proceeded to store information regarding all trials from all 18 sessions into a data frame where I also ensured to split the time into 40 time bins so that I could analyze the average spike rates across all time bins. To make the process of visualizing feedback type easier, I filtered out any of the feedback that was 0 and created a column called success that only displays the successes from the data (where feedback is 1).
I began exploring the data further to notice any patterns or interactions between features. My first instinct was to get an understanding of the total number of neurons in each session, as well as observing which of the mice would have a higher success rate. Lederberg proved to have the highest success rate based on feedback type. I also wanted to know more about which focused brain area yielded the highest average spike rates. I noticed that RN was the brain area that yielded the highest average spike rate and also understood through further exploration of the data that session 13 incorporated RN the most. The next graph I visualized was the average spikes of all sessions across the trials:
Observing Avg Spike Rates across trials of all sessions
I observed that for all the trials that had a little over 1000
neurons, sessions 4, 5, 6 and 13 to name a few, had little variation in
their average spike rates whereas session 11 with less than 1000 neurons
in each trial seemed to display very evident fluctuations. I wanted some
visualization that focused more on the time bins and their individual
success rates and for this, I created a bar chart: Upon looking at the graph, we notice
that the largest amount of failures (reminder that success is based on
the feedback type feature) is for the 10th bin. Compared to the rate of
failures in bins 1 through 20, the last 20 bins do not display as much
of failures. I also noticed that bins 1 and 38 so I believed it would be
insightful to see if there would be any change in performance if I were
to split the data into subsets of bins 1-20 and 21-40 (since the two
peaks of average spike rate would be in either sub-dataset).
Since I believed that performing Principal Component Analysis (PCA)
could help in reducing the dimensionality of the large dataset so that I
can get a closer look at the information and interactions within
clusters for each session, I created the following visualization:
As one can notice through the comparison of PCs 1 and 2, most of the sessions are similar since the clusters overlap and are very close to each other in distance (suggesting similar PC1 and PC2 results), but sessions 13 and 3 seem to be slightly different from the rest of the sessions. As I had observed through the graph from the EDA section, session 11 displays the most spread of plot points, possibly due to the constant fluctuations in average spike rate across trials.
Although I had initially considered deleting sessions based on the cluster differences from the PCA visualization, I thought it was essential to evaluate the larger implications of the data and consider alternative methods to ensure that the the analysis remained informative throughout the predictive modeling process. All the 18 sessions seemed to have similar quality in terms of data, so removing any of them would not prove to be substantially helpful in improving data quality. Moreover, I did not want the removal of sessions to result in biased data as well. Instead, I decided to standardize the average spike rates across bins to have a more uniform distribution in the data. After looking at the findings from EDA, I also decided to split my data into two datasets so I can get a larger emphasis on the question I had in mind: Do the last 20 time bins display better performance on average than the first 20?
Using the data I already had in hand, I utilized cross-validation techniques to split 80% of the data into training and 20% into test.
I incorporated three models to test the overall performance on before-unseen data. They were XG Boost, Logistic Regression and Random Forest in that order. For all three models, I built functions that take in either one of the two datasets I created previously (1 consisting of the first 20 time bins and second consisting of the last 20 time bins). I first created two XG Boost models, and the criteria that I defined to evaluate the performance of my model was accuracy score. I also printed out the training loss for the models. Upon looking at the training loss, I noticed that the training loss did not star with a significantly small value, avoiding possible indicators of overfitting. Upon looking at the accuracy, It could be conceded that the first 20 bins performed slightly worse than the last 20 bins, where the accuracy scores were 0.722 and 0.726 respectively.
The next model I incorporated to train and test my data on was Logistic Regression. Here, I again used accuracy scores as a means to evaluate the performance of the models on the first 20 bins and last 20 bins separately. I was surprised by the results from logistic regression, since the last 20 bins displayed poorer performance compared to the first 20 bins, where the accuracy scores were 0.721 for the first 20 bins and 0.680 for the last 20 bins.
Lastly, I used Random Forest to also test the performance of the model on the two sections of the larger data. I ensured to use Feedback type as the split for the ensemble of trees that would be created, and used RSME as the metric criterion to evaluate performance since accuracy scores are not always the best way to evaluate random forest models. Attesting to what I had envisioned, the last 20 bins performed better with an RSME value of 0.838, indicating a lower error rate. Overall, I decided to use the random forest model as the best model to evaluate the performance of such data because RMSE provides a more nuanced evaluation due to its ability to understand the magnitude of the errors. This further allows for a more precise assessment of model performance and aptly reflects the underlying relationships between features.
I was able to successfully load the test sets and run them for Logistic Regression model and my random forest model. I used the accuracy criterion to evaluate performance for Logistic regression because accuracy ensured that the evaluation of model performance is as balanced as it can be, due to the model’s binary tendency. It also reflects the model’s ability to accurately classify without having obvious bias. Accuracy can also ensure model’s optimization and also somewhat reduce misclassification errors. When tested on the model that has the first 20 bins, the accuracy was approximately 0.706 while the model with the last 20 bins displayed an accuracy of 0.674.
For Random Forest, I used the RSME criterion to evaluate model performance because I believed that it was best for optimizing the random forest models and since RSME prevents outliers from affecting the evaluation of model performancing (as it squared the errors before taking the root) it is quite robust to noise in target variables. For the model with the first 20 bins, I got an RSME value of 0.848 while for the model with the last 20 bins, the RSME value was 0.839
Lastly, for the XGBoost model, I again chose to incorporate accuracy scores as my criterion for evaluating model performance, for similar reasons to that of logistic regression. The accuracy was as I had expected, where the first 20 bins had an accuracy score of about 0.721 while the model with the last 20 bins displayed an accuracy score of 0.726
Overall, the test set performance was not up to my expectations. Although xgboost turned out to be the best model while working with the test set, I had the notion that just like the findings from the performance on the sessions data that I had, the general trend would be that the last 20 bins would display better performance compared to the first 20. Some potential challenges and limitations to the evaluation could have been the scale and ditribution of the data across models affecting their performance. Although I had standardized the data while working with the sessions data, that would not have been the case with the testing data. Moreover, I could have slightly overfit the data while training the model so in the future, a smaller training data could be beneficial.